{
 "metadata": {
  "name": "",
  "signature": "sha256:16d71f083f9324160903742a12d5e0db7dd591e35de726792d846ab23247b5e4"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# k Nearest Neighbors \n",
      "We are going to learn how to label some unlabled data based on *similar* characterisitcs to labled data. In terms of vocabulary, we are going to create a classifier for a specific target variable using feature comparisons. \n",
      "\n",
      "The starting point is the labeled dataset with various features. In this n-dimensional space, we then caluculate the distance between our unknown datum and it's k closests neighbors. \n",
      "\n",
      "The process has three main parts:\n",
      "\n",
      "- Optimize k\n",
      "- Distance calculation\n",
      "- Lablel logic\n",
      "\n",
      "This tutorial is uses two main resources:\n",
      " - [Machine Learning In Action](http://www.manning.com/pharrington/)\n",
      " - [Scikit-learn KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)\n",
      "\n",
      "### Optimize k\n",
      "A common question: Since k is an input, how do we choose the size of k?\n",
      "\n",
      "We can optimize k before we use kNN. This technique usually involves something like n-folds and testing various values of k, which let's us compare the mean error rates and choose \"the best\" k-value.   \n",
      "\n",
      "An alternative approach: optimize k while we use kNN. What if we let k vary (for each unknown) depending on the distribution of distances? A simple application would be to calculate all distances and k to be the number of neighbors within 1 standard deviation from the mean distance. \n",
      "\n",
      "In summary, optimizing k is a great idea. I don't have a great answer for you, but many people seem to like n-folds cross validation [[reference paper](http://lshtc.iit.demokritos.gr/system/files/XiaogangHan.pdf)].  \n",
      "\n",
      "### Distance calculation\n",
      "We can use euclidean distance:\n",
      "\n",
      "$ \\text{distance = }\n",
      "    \\sum \\limits_{i=1}^{n} \n",
      "    \\left(a_i - b_i \\right)^2 $\n",
      "    \n",
      "Alternatively, we could measure distance using something like using curved space. The scikit-learn package has [various options](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html#sklearn.neighbors.DistanceMetric).\n",
      "\n",
      "(*add example code*)\n",
      "\n",
      "The method for compute distance is an important choice since it directly quantifies the values used in the logic to determine \"nearness\" of the neighbors. \n",
      "\n",
      "This computation can also be the source of major memory issues. If we can recognize those points \"closer\" to the unknown without computing and storing distances for the entire dataset, then we can drastically improve the performace of kNN.  \n",
      "\n",
      "### Label Logic\n",
      "Several questions can be considered while building the logic of labeling the unknown using kNN:\n",
      "- When a tie vote occors, which label should be selected?\n",
      "- Should those lables of *closer* nearest neighbors be weighted? \n",
      "- How should votes be counted? \n",
      "- Is the return value a confidence score for each label? \n",
      "\n",
      "A typical technique is to organize the k neariest neighbors labels as votes towards a certain label. The candidate with the most votes could win. \n",
      "\n",
      "However, the decision making behind the label returned is entirely up to the author of the technique. \n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "import operator\n",
      "\n",
      "def createDataSet():\n",
      "    \n",
      "    # features: [boolean for \"poops in a litter box\", average number of \"morning meows\", average number of \"kisses per greeting\"]\n",
      "    dataSet = np.array([ [1.0,1.1,0],[1.0,1.0,1],[0,0,20],[0,0.1,3] ])\n",
      "    \n",
      "    #labels = [0,0,1,1]\n",
      "    labels = np.array(['CAT','CAT','DOG','DOG'])\n",
      "    \n",
      "    return dataSet, labels\n",
      "\n",
      "dataSet, labels = createDataSet()\n",
      "print \"dataSet: \\n{}\".format(dataSet)\n",
      "print\n",
      "print \"labels: \\n{}\".format(labels)\n",
      "print \n",
      "print \"group.shape: \\n{}\".format(dataSet.shape)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### ML in Action\n",
      "We can build our own custom kNN classifier based on code similar to that in [Machine Learning In Action](http://www.manning.com/pharrington/)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def kNN(unknown,dataSet,labels,k):\n",
      "    \"\"\"\n",
      "    Apply kNN algorithm\n",
      "    \"\"\"\n",
      "    \n",
      "    # get length of dataset\n",
      "    dataSetSize = dataSet.shape[0]\n",
      "    \n",
      "    # tile creates a matrix the same size as dataSet with repeated rows of the unknown\n",
      "    diffMat = np.tile(unknown, (dataSetSize,1)) - dataSet\n",
      "    \n",
      "    # DISTANCE CALCULATION (eucledian distance from unknown to labeled points)\n",
      "    sqDiff = diffMat ** 2\n",
      "    sqDistances = sqDiff.sum(axis=1)\n",
      "    distances = sqDistances ** 0.5\n",
      "    \n",
      "    # sort the index of distances  \n",
      "    sortedDistIndex = distances.argsort()\n",
      "    \n",
      "    # LABEL LOGIC (here we use the shortest distance and votes are summed up to define the label)\n",
      "    classCount = {}\n",
      "    for i in range(k):\n",
      "        # grabs the labels the of the k nearest neighbors\n",
      "        voteLabel = labels[sortedDistIndex[i]]\n",
      "        \n",
      "        # tally the labels like votes\n",
      "        classCount[voteLabel] = classCount.get(voteLabel,0) + 1\n",
      "        \n",
      "    # sorts the tally (note: if it's a tie, then what?)\n",
      "    sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)\n",
      "    print \"tally: \\n{}\\n\".format(sortedClassCount)\n",
      "    \n",
      "    # return the label with the most votes\n",
      "    return sortedClassCount[0][0]\n",
      "    \n",
      "        \n",
      "        "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# use the tally to lable the unknown\n",
      "print 'CUSTOM classification: \\n{}'.format(kNN([0,0,2], dataSet, labels,3))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### scikit-learn \n",
      "We can use the [KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) from scikit-learn to accomplish this task as well, but we have several algorithms from which we can choose. The following choices allow us to consider how we calculate the nearest neighbors. \n",
      "\n",
      "- `BallTree1` - decision making technique for determining which distances to calculate.\n",
      "- `KDTree1` - decision making technique for determining which distances to calculate.\n",
      "- `Brute` - Calculate all distances between the unknown and the known labels. \n",
      "- `Auto` - Atempts to choose the best algorithm.\n",
      "\n",
      "\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# use scikit-learn to classfiy\n",
      "from sklearn import datasets\n",
      "from sklearn.neighbors import KNeighborsClassifier\n",
      "\n",
      "# Create the kNN object \n",
      "\n",
      "knn = KNeighborsClassifier(n_neighbors=3)\n",
      "\n",
      "# Use the known data\n",
      "knn.fit(dataSet, labels) \n",
      "\n",
      "# Predict the unknown\n",
      "print 'Scikit-learn classification: \\n{}'.format(knn.predict(np.array([0,0,1])))\n",
      "\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Ideas for improvements (fork me!):\n",
      "- k-folds example\n",
      "- real/practical example using kNN\n",
      "- explain scikit-learn algorithms for calculating the nearest neighbor (explain why to use specific algorithms with examples)\n",
      "- add details for the distance issue (ie. calculation distance for the entire dataset)\n",
      "- What's the break point for using kNN brute force vs other algorithms? \n",
      "- Normalize the feature space (possibly a full on RST)\n",
      "\n"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}